[VL] Support mapping columns by position index for ORC and Parquet files#10697

Merged

rui-mo merged 1 commit intoapache:mainfrom

kevinwilfong:map_by_index

Oct 15, 2025

Collaborator

kevinwilfong commented Sep 12, 2025 •

edited by rui-mo

Loading

What changes are proposed in this pull request?

In our data warehouse we support schema evolution by column index rather than by name. E.g. if a Hive table has schema a, b, c but the partition has schema c, a, b we won't reorder the columns from the partition, but read partition column c as column a, partition column a as column b, etc.

This is supported in Velox by setting the configs hive.orc.use-column-names and hive.parquet.use-column-names in the HiveConfig to false for ORC and Parquet files respectively. Currently these are both hard coded to true in Gluten. This change adds configs to Gluten's VeloxConfig spark.gluten.sql.columnar.backend.velox.orcUseColumnNames and spark.gluten.sql.columnar.backend.velox.parquetUseColumnNames and plumbs these to the HiveConfig in Velox.

In addition, we need to pass the full table schema to the HiveTableHandle, as this is how Velox determines the indices of each column. I updated VeloxIteratorApi to set the FileSchema for the LocalFilesNodes it generates if necessary (if the config is enabled for the format of the file), and VeloxPlanConverter/SubstraitToVeloxPlan to propagate this to the HiveTableHandle when present.

Note that I considered just setting it in the ReadRel rather than in each LocalFilesNode. This however introduced the problem that we could no longer read from tables with column types we don't support, even if we don't read those columns, as we still need to propagate them to the HiveTableHandle. Since partition file formats don't always match table file formats, we don't know if we need the schema until we generate the splits, at which point it's too late to update the plan. See #10569

Please note that vanilla Spark partially supports matching by the position index https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers.

How was this patch tested?

Added tests for ORC and Parquet files where the column names in the table don't match the column names in the file, and verified we could still read them by index when the flags are enabled.

github-actions bot added CORE VELOX labels

github-actions bot commented Sep 12, 2025

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions bot commented Sep 12, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from b503cb0 to 2ac85a5 Compare

September 12, 2025 19:16

github-actions bot added BUILD INFRA RSS TOOLS CLICKHOUSE labels

github-actions bot commented Sep 12, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from 2ac85a5 to 7d91ef5 Compare

September 12, 2025 19:23

github-actions bot removed BUILD INFRA RSS TOOLS CLICKHOUSE labels

github-actions bot commented Sep 12, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from 7d91ef5 to 3eaad25 Compare

September 15, 2025 18:52

github-actions bot commented Sep 15, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from 3eaad25 to 4e6d1e5 Compare

September 16, 2025 17:25

github-actions bot added the DOCS label

github-actions bot commented Sep 16, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from 4e6d1e5 to fd6165c Compare

September 16, 2025 21:17

github-actions bot commented Sep 16, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from fd6165c to f996f22 Compare

September 16, 2025 23:12

github-actions bot added the CLICKHOUSE label

github-actions bot commented Sep 16, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from f996f22 to d034d21 Compare

September 16, 2025 23:15

github-actions bot commented Sep 16, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from 1997c88 to a23b939 Compare

October 3, 2025 21:03

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong force-pushed the map_by_index branch from a23b939 to 542c2d2 Compare

October 3, 2025 22:51

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

kevinwilfong requested a review from rui-mo

October 6, 2025 16:52

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment

github-actions bot commented Oct 8, 2025

Run Gluten Clickhouse CI on x86

Yohahaha approved these changes

View reviewed changes

Contributor

Yohahaha left a comment

LGTM! thank you!

github-actions bot commented Oct 9, 2025

Run Gluten Clickhouse CI on x86

rui-mo approved these changes

View reviewed changes

Contributor

rui-mo left a comment

Thanks for the update. Just one nit and the other change LGTM.

cpp/velox/compute/VeloxPlanConverter.cc Show resolved Hide resolved


          [VL] Support mapping columns by index for ORC and Parquet files

aa450a4

kevinwilfong force-pushed the map_by_index branch from fa4e9b7 to aa450a4 Compare

October 13, 2025 16:59

github-actions bot commented Oct 13, 2025

Run Gluten Clickhouse CI on x86

rui-mo reviewed

View reviewed changes

Contributor

rui-mo left a comment

Here’s what comes to mind. There are the below three strategies for column mapping:

Match by position
Match by field name
Match by unique permanent ID

And I suppose Spark only supports (2) and (3) (seeing facebookincubator/velox#6065 (comment)), while Velox supports (1) and (2). Would you please clarify which is supported in this PR?

Collaborator Author

kevinwilfong commented Oct 14, 2025

Here’s what comes to mind. There are the below three strategies for column mapping:

Match by position

Match by field name

Match by unique permanent ID

And I suppose Spark only supports (2) and (3) (seeing facebookincubator/velox#6065 (comment)), while Velox supports (1) and (2). Would you please clarify which is supported in this PR?

@rui-mo This exposes Velox's support for (1)

Contributor

rui-mo commented Oct 14, 2025 •

edited

Loading

@kevinwilfong I wonder if Spark supports matching by the position index, and I assumed it only supports matching by the file ID or column name. Please correct me if it's wrong.

Collaborator Author

kevinwilfong commented Oct 14, 2025 •

edited

Loading

Vanilla Spark partially supports it https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers (it's what we do)

rui-mo changed the title ~~[VL] Support mapping columns by index for ORC and Parquet files~~ [VL] Support mapping columns by position index for ORC and Parquet files

Contributor

rui-mo commented Oct 15, 2025

cc: @zhztheplayer If there are no further comments, we can proceed to merge this PR.

Member

zhztheplayer commented Oct 15, 2025

@rui-mo Please proceed. Thank you very much for the review.

rui-mo merged commit 6548ab4 into apache:main

62 of 63 checks passed

ccat3z mentioned this pull request

[VL] Support read old ORC file without column names #8862

Closed

Contributor

beliefer commented Nov 3, 2025 •

edited

Loading

@kevinwilfong I encountered the issue too. I picked up this PR to my Gluten. But the problem still exists.
We created the table with Hive client.

CREATE TABLE default.test_orc_table_hive_gluten
(
    id int,
    name string
)
PARTITIONED BY (dt string)
STORED AS ORC;

insert into test_orc_table_hive_gluten partition(dt='20240728') values (1, 'a'),(2,'b');

And query this table with Spark SQL.

select * from test_orc_table_hive_gluten where dt = '20240728';

The output show below.

NULL	NULL	20240728
NULL	NULL	20240728

I set these configs show below.

set spark.gluten.sql.columnar.backend.velox.orcUseColumnNames=false; 
set spark.gluten.sql.complexType.scan.fallback.enabled=false;

But nothing helped.

Did I miss something?

Collaborator Author

kevinwilfong commented Nov 3, 2025

@beliefer That's strange, do you get the same results if you don't set spark.gluten.sql.columnar.backend.velox.orcUseColumnNames to false?

Nothing in your repro changes the schema of your data, so it should work regardless of the value of that config

Contributor

beliefer commented Nov 4, 2025

@kevinwilfong Yes. No matter the value of spark.gluten.sql.columnar.backend.velox.orcUseColumnNames, the output is wrong.

beliefer mentioned this pull request

[VL] Orc native scan can't read the table created with Hive #11010

Closed

Collaborator Author

kevinwilfong commented Nov 4, 2025

In that case, as was mentioned in the Issue that was linked, I think it's not related to this change.

markjin1990 pushed a commit to markjin1990/gluten that referenced this pull request


          [VL] Support mapping columns by position index for ORC and Parquet fi…

6eff801

…les (apache#10697)

markjin1990 mentioned this pull request

Cherry-pick https://github.com/apache/incubator-gluten/pull/10697 and adopt it in Bolt backend to support legacy orc files in Bolt WangGuangxin/gluten#11

Merged

WangGuangxin pushed a commit to WangGuangxin/gluten that referenced this pull request


          [VL] Support mapping columns by position index for ORC and Parquet fi…

47eff98

…les (apache#10697)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE DOCS VELOX